MY472 Summative 3 Project
What, if any, characteristics and factors discriminate MPs who tend to ask questions about economic issues from MPs who tend to ask questions about health and welfare issues?
Link to public GitHub repository: https://github.com/lenmetson/my472-final-assignment.
Introduction
Parliamentary questions are a critical part of how MPs represent their constituency. Therefore, I focus on asking what factors about the constituency an MP represents drives their focus on economic or health and welfare topics in 2023. I operationalise “focus” as the proportion of questions a member asks about a given topic.
Data
I drew on data from two sources: the UK Parliament API (“API”), and the UK House of Commons constituency dashboard (“dashboard”). I limit my analysis to questions asked in 2023. I store data efficiently, I used a local relational database.
API
First, I pulled the text of questions from the API oral and written question endpoints. For oral questions, the API only returns questions asked in the House of Commons. However, the written question endpoint returns questions from both the House of Lords and the House of Commons. Therefore, I added a parameter to the request URL to only return written questions asked by members of the House of Commons. After flattening and cleaning the responses and adding a variable oral_written to distinguish the question type, I merged both types of question into one dataframe and wrote it to my database as "questions".
For each question, I wanted to be able to pull in additional data about the MP who had asked it and the minister they asked it to. Some MP characteristics, such as their party affiliation or seat, change over time. The API members endpoint allows queries to specify a date. It then returns data as valid from that date. To pull make the fewest possible requests, I created unique MP-date pairs from my questions table.
Many MP characteristics change over time - such as the party they represent, the ministerial posts they hold, etc. Therefore, we have to construct a members table that accommodates these changes. The API allows queries that specify a date. This will give us a unique response that is valid on the day of each question. I will write out this “summarised” version of the data to my database. Then in analysis, I can query the entry that is valid for when each question is asked.
However, because characteristics do not change very frequently, I did not want to write out a table with data on each MP for every day they had asked a question. Therefore, I grouped the clean response table by each unique combination of MP and their characteristics and summarised the earliest and latest date this combination was valid for. This reduced the number of rows I would write to my members table from 4225 to 482.
I then used the constituency endpoints to pull the results of the last 4 elections held in each constituency and a shapefile for each constituency.
Dashboard
The data on the demographics of constituencies from the UK Parliament API is very limited. Therefore, I used the Commons Library constituency dashboard to add demographic variables. This data source does not have an API endpoint and requires each constituency to be looked up using a search tool. Therefore, I used Selenium to interactively scrape the data. One limitation of this data source was that data on the median house price in each constituency was not available for constituencies in Scotland and Northern Ireland.
After merging the scraped data with the constituency data pulled from the API, I wrote out the clean dataframe to constituencies.
Finally, I obtained party names and colours from the API and wrote out the results as the table "parties" for use in plotting.
Measurement
To measure whether a question is about (1) economic issues or (2) health and welfare, I use a simple dictionary approach. Whilst limited compared to machine learning classification approaches, dictionary string matching was more feasible for this project as it does not require expert labelling of a training set.
I wrote out the results of my measuremnt to a table called question_topics.
Final database
This resulted in 5 tables in my local database:
- questions
- members
- constituencies
- parties
- question_topics
Analysis
\[ econ\_slant = \frac{N (economic)}{N(questions)} - \frac{N(health\_welfare)}{N(questions)} \]
Code appendix
knitr::opts_chunk$set(
echo = FALSE,
eval = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE)
# PICKUP
# Can also be run by sourcing scripts/00_setup.R
# Define function to install or load packages
load_packages <- function(x) {
y <- x %in% rownames(installed.packages())
if(any(!y)) install.packages(x[!y])
invisible(lapply(x, library, character.only=T))
rm(x, y)
}
# Load required packagess
load_packages(c(
"tidyverse",
"here",
# Database management
"DBI",
"RSQLite",
# APIs and webscraping
"httr",
"RSelenium",
# Text analysis
"tm",
"XML",
# Geospatial plots
"tmap",
"sf",
"ggrepel",
# Random forests
"parallel",
"ranger",
"tidymodels",
"vip",
"rpart",
"rpart.plot",
# Plotting
"gridExtra"
))
replace_null_with_na <- function(x) {
if (is.list(x)) { # Checks for whether the item is a sublist
lapply(x, replace_null_with_na) # if it is, apply the function for each of the elements within the sublist
} else { # If it isn't, simply apply the main function
ifelse(is.null(x) || x == "null", "NA", x)
}
}
replace_na_chr <- function(df) { # NOTE function adapted from ChatGPT output
df <- df %>%
mutate(across(where(is.character), ~na_if(., "NA")))
return(df)
}
db_table_check <- function(table){
rows <- dbGetQuery(db, paste0("SELECT COUNT(1) FROM ", table))
cols <- dbListFields(db, table)
result = list(
table = table,
n_rows = rows[[1]],
col_names = cols)
return(result)
}
db <- DBI::dbConnect(RSQLite::SQLite(), here("data/parliament_database.sqlite"))
# This code can also be run by sourcing scripts/01_pull-oral-questions.R
GET_qs <- function(endpoint_url, n_skip = 0) {
url <- paste0(
endpoint_url,
"?parameters.skip=",
n_skip,
"¶meters.answeringDateStart=2023-01-01¶meters.answeringDateEnd=2023-12-31", # Limit to 2023
"¶meters.take=100")
response <-
httr::GET(url) %>%
httr::content("parsed") # Use :: because tm masks content
return(response)
}
# Define functions to pull all questions
pull_all_oral_qs <- function(endpoint_url){
# Calculate how many questions are in the end point
n_resp <- httr::GET(paste0(
endpoint_url,
"?parameters.answeringDateStart=2023-01-01¶meters.answeringDateEnd=2023-12-31", # Limit to 2023
"¶meters.take=1")) %>%
httr::content("parsed")
n <- n_resp$PagingInfo$GlobalTotal
# Questions can be pulled in batches of 100,
# calculate how many time we will have to pull
n_loops <- ceiling(n / 100)
print(paste0("LOG | ", Sys.time(), " | Oral question pull starting"))
for (i in 1:n_loops) {
n_skip <- (i - 1) * 100 # Skip however many 100s the loop has run
if (i == 1) { # On first iteration, make new list
response <- GET_qs(endpoint_url, n_skip)
response <- response$Response
} else { # On all other iterations, append to existing list
response_new <- GET_qs(endpoint_url, n_skip)
response_new <- response_new$Response
response <- c(response, response_new) # Merge responses
}
print(paste0("LOG | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message
Sys.sleep(1) # Sleep to avoid hammering the API
}
print(paste0("LOG | ", Sys.time(), " | Oral question pull done :)"))
return(response)
}
## APPLY FUNCTIONS
oral_questions <- pull_all_oral_qs(
"https://oralquestionsandmotions-api.parliament.uk/oralquestions/list")
saveRDS(oral_questions, "data/oral_questions_2023.RDS")
# This code can also be run by sourcing scripts/02_pull-written-questions.R
GET_qs_written <- function(endpoint_url, n_skip = 0) {
url <- paste0(
endpoint_url,
"?skip=",
n_skip,
"&tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
"&house=Commons", # Limit to HoC
"&take=100")
response <-
httr::GET(url) %>%
httr::content("parsed") # Use :: because tm masks content
return(response)
}
pull_all_written_qs <- function(endpoint_url){
n_resp <- httr::GET(
paste0(
endpoint_url,
"?tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
"&house=Commons", # Limit to HoC
"&take=1")) %>%
httr::content("parsed")
n <- n_resp$totalResults
# Questions can be pulled in batches of 100, calculate how many time we will have to pull
n_loops <- ceiling(n/100)
for(i in 1:n_loops){
n_skip <- (i-1)*100 # Skip however many 100s the loop has run
if(i==1){ # On first iteration, make new list
response <- GET_qs_written(endpoint_url, n_skip)
response <- response$results
} else { # On all other iterations, append to existing list
responseNew <- GET_qs_written(endpoint_url, n_skip)
responseNew <- responseNew$results
response <- c(response, responseNew) # Merge responses
}
print(paste0("LOG | Written questions | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message
Sys.sleep(0.5) # Sleep to avoid hammering the API
}
print(paste0("LOG | Written questions | ", Sys.time(), " | Written question pull done :)"))
return(response)
}
written_questions <- pull_all_written_qs("https://questions-statements-api.parliament.uk/api/writtenquestions/questions")
saveRDS(written_questions, "data/written_questions_2023.RDS")
oral_questions <- readRDS("data/oral_questions_2023.RDS")
for (i in seq_along(oral_questions)) {
# remove sublists, otherwise names do not match
oral_questions[[i]]$AskingMember <- NULL
oral_questions[[i]][["AnsweringMinister"]] <- NULL
oral_questions[[i]] <- replace_null_with_na(oral_questions[[i]])
if (i == 1){
oral_question_df <- data.frame(oral_questions[[i]])
} else {
oral_question_df2 <- data.frame(oral_questions[[i]])
oral_question_df <- rbind(oral_question_df, oral_question_df2)
}
}
rm(oral_question_df2, i)
### Clean dataframes and merge into one table ####
oral_question_df <- oral_question_df %>%
select(
question_id = Id,
question_text = QuestionText,
asking_member = AskingMemberId,
question_tabled_when = TabledWhen,
question_answering_when = AnsweringWhen,
question_answering_body = AnsweringBody,
question_answering_body_id = AnsweringBodyId,
answering_member = AnsweringMinisterId) %>%
# Ensure variables are the correct class
mutate(
question_id = as.character(question_id),
question_text = as.character(question_text),
asking_member = as.character(asking_member),
question_tabled_when = as.character(as.Date(question_tabled_when)),
question_answering_when = as.character(as.Date(question_answering_when)),
question_answering_body = as.character(question_answering_body),
question_answering_body_id = as.character(question_answering_body_id),
answering_member = as.character(answering_member),
oral_written = "oral") # add written_oral dummy
written_questions <- readRDS("data/written_questions_2023.RDS")
for (i in seq_along(written_questions)) {
# Remove links sublist by keeping only "value"
written_questions[[i]] <- written_questions[[i]]$value
# remove sublists, otherwise names do not match
written_questions[[i]]$groupedQuestions <- NULL
written_questions[[i]]$attachments <- NULL
written_questions[[i]]$groupedQuestionsDates <- NULL
# Replace nulls with NAs
written_questions[[i]] <- replace_null_with_na(written_questions[[i]])
if (i == 1){
written_question_df <- data.frame(written_questions[[i]])
} else {
written_question_df2 <- data.frame(written_questions[[i]])
written_question_df <- rbind(written_question_df, written_question_df2)
}
print(paste0(i, " of ", length(written_questions)))
}
rm(written_question_df2, i)
written_question_df <- written_question_df %>%
select(
question_id = id,
question_text = questionText,
asking_member = askingMemberId,
question_tabled_when = dateTabled,
question_answering_when = dateForAnswer,
question_answering_body = answeringBodyName,
question_answering_body_id = answeringBodyId,
answering_member = answeringMemberId) %>%
# Ensure variables are the correct class
mutate(
question_id = as.character(question_id),
question_text = as.character(question_text),
asking_member = as.character(asking_member),
question_tabled_when = as.character(as.Date(question_tabled_when)),
question_answering_when = as.character(as.Date(question_answering_when)),
question_answering_body = as.character(question_answering_body),
question_answering_body_id = as.character(question_answering_body_id),
answering_member = as.character(answering_member),
oral_written = "written") # add written_oral dummy
question_df <- rbind(oral_question_df, written_question_df)
dbWriteTable(db, "questions", question_df, overwrite = TRUE)
rm(oral_question_df, written_question_df, question_df)
# This code can also be run by sourcing scripts/03_pull-members-endpoint.R
pull_members <- function(base_url, df) {
for (i in seq_along(df$member_id)) {
url <- paste0( # Build request URL
base_url, "/",
df$member_id[i],
"?detailsForDate=",
df$question_tabled_when[i])
if (i == 1) { # If 1st iteration, create response,
response <- httr::GET(url) %>% httr::content("parsed") # Pull request
response <- response[1] # Extract list with response
response <- c(
date = df$question_tabled_when[i], response[[1]]) # Merge with date
response <- list(response) # Convert to list
} else { # else create response2, then merge
response_new <- httr::GET(url) %>% httr::content("parsed")
response_new <- response_new[1]
response_new <- c(
date = df$question_tabled_when[i], response_new[[1]])
response_new <- list(response_new)
response <- c(response, response_new) # Merge responses
}
Sys.sleep(1)
print(paste0("LOG Member Pull | ", Sys.time(), " | ", i, " of ", nrow(df), " done"))
}
return(response)
}
# Query question table to get MP-date pairs
members_asking <- dbGetQuery(db,
"
SELECT
asking_member AS member_id,
question_tabled_when
FROM questions
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
ministers_answering <- dbGetQuery(db,
"
SELECT
answering_member AS member_id,
question_tabled_when
FROM questions
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
q_parameters <- rbind(members_asking, ministers_answering)
# Only keep unique MP-date pairs to avoid pulling the same information twice
q_parameters <- unique(q_parameters) %>%
filter(member_id != 0) # Remove 0s because these indicate no minister has answered
# Apply function to pull members
members <- pull_members(
"https://members-api.parliament.uk/api/Members",
q_parameters)
saveRDS(members, "data/members_raw.Rds")
members <- readRDS("data/members_raw.Rds")
# Replace "null" values with NA so they are kept in the structure of the list
members <- lapply(
members,
function(x) {lapply(x, replace_null_with_na)})
# Convert to dataframe
for (i in seq_along(members)) {
if (i == 1) {
members_df <- members[i] %>%
unlist() %>%
t() %>%
data.frame()
} else {
members_df_new <- members[i] %>%
unlist() %>%
t() %>%
data.frame()
members_df <- rbind(members_df, members_df_new)
}
}
members_df <- members_df %>%
select(
member_date_valid = date,
member_id = id,
name_display = nameDisplayAs,
gender = gender,
latest_constituency = latestHouseMembership.membershipFromId,
latest_party_id = latestParty.id
)
# Some MP characteristics change over time, so we collected unique MP-day queries.
# However, characteristics do not change daily so there is lots of repitition.
# The following code groups MPs by the mutable variables (i.e. unique combinations,
# then summarises the earliest valid, and the latest valid date)
# Before this function there are 4225 observations, and after, only 482.
members_df_grouped <- members_df %>%
group_by( # Group by all variables apart from date
member_id,
name_display,
gender,
latest_constituency,
latest_party_id
) %>%
summarize( # Summarise earliest date this is valid for and latest. This gives us a range of vlaues where this combination is duplicated
member_date_valid_min = min(member_date_valid),
member_date_valid_max = max(member_date_valid)
) %>%
mutate(
member_id = as.character(member_id),
name_display = as.character(name_display),
gender = as.character(gender),
latest_constituency = as.character(latest_constituency),
latest_party_id = as.character(latest_party_id),
member_date_valid_min = as.character(member_date_valid_min),
member_date_valid_max = as.character(member_date_valid_max)
)
#unique(members_df$member_id) %>% length() # This returns 474, indicating there are changes
dbWriteTable(db, "members", members_df_grouped, overwrite = TRUE)
# This code can also be run by sourcing scripts/04_pull-constituency-endpoints.R
MPs <- dbGetQuery(db,
"
SELECT *
FROM members
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
constituencies <- MPs$latest_constituency %>%
unique()
constituencies <-
data.frame(
constituency_id = constituencies
) %>%
mutate(
cons_name = NA,
cons_start_date = NA,
cons_end_date = NA,
last_election_1_electorate = NA,
last_election_1_turnout = NA,
last_election_1_majority = NA,
last_election_1_result = NA,
last_election_1_winning_party = NA,
last_election_1_election_ID = NA,
last_election_1_electionDate = NA,
last_election_1_isGeneralElection = NA,
last_election_2_electorate = NA,
last_election_2_turnout = NA,
last_election_2_majority = NA,
last_election_2_result = NA,
last_election_2_winning_party = NA,
last_election_2_election_ID = NA,
last_election_2_electionDate = NA,
last_election_2_isGeneralElection = NA,
last_election_3_electorate = NA,
last_election_3_turnout = NA,
last_election_3_majority = NA,
last_election_3_result = NA,
last_election_3_winning_party = NA,
last_election_3_election_ID = NA,
last_election_3_electionDate = NA,
last_election_3_isGeneralElection = NA,
last_election_4_electorate = NA,
last_election_4_turnout = NA,
last_election_4_majority = NA,
last_election_4_result = NA,
last_election_4_winning_party = NA,
last_election_4_election_ID = NA,
last_election_4_electionDate = NA,
last_election_4_isGeneralElection = NA,
shapefile = NA
)
### Pull basic details
pull_const_info <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id)
basic_info <- httr::GET(url) %>%
httr::content("parsed")
return(basic_info)
}
for(i in seq_along(constituencies$constituency_id)) {
response <- pull_const_info(constituencies$constituency_id[i])
response <- response[[1]]
constituencies$cons_name[i] <- response$name
constituencies$cons_start_date[i] <- response$startDate
constituencies$cons_end_date[i] <- ifelse(is.null(response$endDate), NA, response$endDate)
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - basic | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
### Pull shape file
get_cons_shapefile <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id,
"/Geometry")
shapefile <- httr::GET(url) %>%
httr::content("parsed")
return(shapefile)
}
for(i in seq_along(constituencies$constituency_id)) {
response <- get_cons_shapefile(constituencies$constituency_id[i])
response <- response[[1]]
constituencies$shapefile[i] <- response
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - shapefile | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
### Pull election results
get_cons_election_results <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id,
"/ElectionResults")
results <- httr::GET(url) %>%
httr::content("parsed")
return(results)
}
for (i in seq_along(constituencies$constituency_id)) {
response <- get_cons_election_results(constituencies$constituency_id[i])
response <- response[[1]]
response <- lapply(response, function(lst) {lapply(lst, replace_null_with_na)})
constituencies$last_election_1_electorate[i] <- response[[1]]$electorate
constituencies$last_election_1_turnout[i] <- response[[1]]$turnout
constituencies$last_election_1_majority[i] <- response[[1]]$majority
constituencies$last_election_1_result[i] <- response[[1]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[1]]$winningParty) > 1) { # When there is content in the winning party sublist, the length will be greater than 1
constituencies$last_election_1_winning_party[i] <- response[[1]]$winningParty$id
} else {
constituencies$last_election_1_winning_party[i] <- NA
}
constituencies$last_election_1_election_ID[i] = response[[1]]$electionId
constituencies$last_election_1_electionDate[i] = response[[1]]$electionDate
constituencies$last_election_1_isGeneralElection[i] = response[[1]]$isGeneralElection
constituencies$last_election_2_electorate[i] <- response[[2]]$electorate
constituencies$last_election_2_turnout[i] <- response[[2]]$turnout
constituencies$last_election_2_majority[i] <- response[[2]]$majority
constituencies$last_election_2_result[i] <- response[[2]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[2]]$winningParty) > 1) {
constituencies$last_election_2_winning_party[i] <- response[[2]]$winningParty$id
} else {
constituencies$last_election_2_winning_party[i] <- NA
}
constituencies$last_election_2_election_ID[i] = response[[2]]$electionId
constituencies$last_election_2_electionDate[i] = response[[2]]$electionDate
constituencies$last_election_2_isGeneralElection[i] = response[[2]]$isGeneralElection
constituencies$last_election_3_electorate[i] <- response[[3]]$electorate
constituencies$last_election_3_turnout[i] <- response[[3]]$turnout
constituencies$last_election_3_majority[i] <- response[[3]]$majority
constituencies$last_election_3_result[i] <- response[[3]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[3]]$winningParty) > 1) {
constituencies$last_election_3_winning_party[i] <- response[[3]]$winningParty$id
} else {
constituencies$last_election_3_winning_party[i] <- NA
}
constituencies$last_election_3_election_ID[i] = response[[3]]$electionId
constituencies$last_election_3_electionDate[i] = response[[3]]$electionDate
constituencies$last_election_3_isGeneralElection[i] = response[[3]]$isGeneralElection
constituencies$last_election_4_electorate[i] <- response[[4]]$electorate
constituencies$last_election_4_turnout[i] <- response[[4]]$turnout
constituencies$last_election_4_majority[i] <- response[[4]]$majority
constituencies$last_election_4_result[i] <- response[[4]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[4]]$winningParty) > 1) {
constituencies$last_election_4_winning_party[i] <- response[[4]]$winningParty$id
} else {
constituencies$last_election_4_winning_party[i] <- NA
}
constituencies$last_election_4_election_ID[i] = response[[4]]$electionId
constituencies$last_election_4_electionDate[i] = response[[4]]$electionDate
constituencies$last_election_4_isGeneralElection[i] = response[[4]]$isGeneralElection
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - elections | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
saveRDS(constituencies, "data/constituencies_api_raw.Rds")
# This code can also be run by sourcing scripts/04_selinium-scrape-HoC-dashboard.R
# NOTE cons_hoc returns 610 not 472 because it was pulled based on constituencies in all oral questions, not just 2023
# Read in data from the constituency endpoint pull
cons <- readRDS("data/constituencies_api_raw.Rds")
# Make new dataframe
cons <- cons %>%
select(constituency_id, cons_name) %>%
unique() %>% # Keep only unqiue constituencies
mutate( # Initialise variables
region_nation_hoclib23 = NA,
population_hoclib23 = NA,
area_hoclib23 = NA,
age_0_29_hoclib23 = NA,
age_30_64_hoclib23 = NA,
age_65_plus_hoclib23 = NA,
uc_claimants_hoclib23 = NA,
median_house_price_hoclib23 = NA
)
# Check whether constituencies have already been pulled and saved. If they have, filter out these so they are not re-scraped.
# If running for the first time, you will not be able to read in cons_hoc, so the filtering is skipped.
if (file.exists("data/hoc_library_scrape_raw.Rds")) {
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
cons$check_already_pulled <- cons$cons_name %in% cons_hoc$cons_name
cons <- cons %>% filter(check_already_pulled == FALSE)
}
# Set selinium browser
rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL)
driver <- rD$client
# Define a list of css selectors
# The dashboard is contained within an "iframe".
# This allows a different html tree to be embedded within the main html of the webpage meaning any CSS paths do not point to the actual path of the webpage.
# To do this, we need to identify the iframe and use `switchToFrame()` to identify elements on the dashboard.
selector_list <- list()
selector_list$search_dropdown <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[1]/transform/div/div[3]/div/div/visual-modern/div/div/div[2]/div/i"
selector_list$search_box <- "/html/body/div[7]/div[1]/div/div[1]/input"
selector_list$search_result <- "/html/body/div[7]/div[1]/div/div[2]/div/div[1]/div/div/div[1]/div/span"
selector_list$region_nation <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[2]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div"
selector_list$population <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[3]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"
selector_list$area <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[5]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"
selector_list$age_0_29 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[11]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$age_30_64 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[13]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$age_65_plus <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[15]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$uc_claimants <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[28]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[1]/div[1]"
selector_list$house_prices <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[39]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[2]/div[1]"
constituency_dash_scraper <- function(
constituency_name,
wait_base = 1 # Allows user to adjust wait lengths (e.g if running on a slow connection)
# If you get a 'could not find element' error, try adjusting the wait time as the dashboard takes a while to load
){
# Find dropdown box and click on it
search_dropdown <- driver$findElement(using = "xpath", value = selector_list$search_dropdown)
search_dropdown$clickElement()
# Find search box and type constituency name
Sys.sleep(wait_base * 2)
search_box <- driver$findElement(using = "xpath", value = selector_list$search_box)
#search_box$clickElement() # Do not strictly need this, but if not working try uncommenting
search_box$clearElement()
search_box$sendKeysToElement(list(constituency_name))
Sys.sleep(wait_base * 4) # This requires a long time to load.
# Click on the first result to load data
first_result <- driver$findElement(using = "xpath", value = selector_list$search_result)
first_result$clickElement()
Sys.sleep(wait_base * 4) # Wait for data to load
# EXTRACT TEXT FROM ELEMENTS
# Set defaults as NA
region_nation_text <- NA
population_text <- NA
area_text <- NA
age_0_29_text <- NA
age_30_64_text <- NA
age_65_plus_text <- NA
uc_claimants_text <- NA
house_prices_text <- NA
# Region or nation
tryCatch({ # Prevent loop from closing if no data available
suppressMessages({
region_nation <- driver$findElement(using = "xpath", value = selector_list$region_nation)
region_nation_text <- region_nation$getElementText()[[1]]
})
}, error = function(e) {
# Print error message, no need to assign NA as we have set NA as default
print(paste0("Log: NA assigned for REGION/NATION at iteration ", i))
})
# Population
tryCatch({
suppressMessages({
population <- driver$findElement(using = "xpath", value = selector_list$population)
population_text <- population$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for POPULATION at iteration ", i))
})
# Area in sq km
tryCatch({
suppressMessages({
area <- driver$findElement(using = "xpath", value = selector_list$area)
area_text <- area$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AREA at iteration ", i))
})
# Age composition
tryCatch({
suppressMessages({
age_0_29 <- driver$findElement(using = "xpath", value = selector_list$age_0_29)
age_0_29_text <- age_0_29$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 0-29 PLUS at iteration ", i))
})
tryCatch({
suppressMessages({
age_30_64 <- driver$findElement(using = "xpath", value = selector_list$age_30_64)
age_30_64_text <- age_30_64$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 30-64 PLUS at iteration ", i))
})
tryCatch({
suppressMessages({
age_65_plus <- driver$findElement(using = "xpath", value = selector_list$age_65_plus)
age_65_plus_text <- age_65_plus$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 64 PLUS at iteration ", i))
})
# Universal credit claimants
tryCatch({
suppressMessages({
uc_claimants <- driver$findElement(using = "xpath", value = selector_list$uc_claimants)
uc_claimants_text <- uc_claimants$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for UC CLAIMANTS at iteration ", i))
})
# House price
tryCatch({
suppressMessages({
house_prices <- driver$findElement(using = "xpath", value = selector_list$house_prices)
house_prices_text <- house_prices$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for HOUSE PRICE at iteration ", i))
})
# Merge results into a list
results = list(
region_nation_text,
population_text, area_text,
age_0_29_text, age_30_64_text, age_65_plus_text,
uc_claimants_text, house_prices_text)
return(results)
}
# Run the scraper
# Navigate to home page outside of the loop to avoid reloading each time
driver$navigate("https://commonslibrary.parliament.uk/constituency-dashboard/")
Sys.sleep(1)
# The dashboard exists within a sub-page. Unless we "switch" to this subframe, the css paths will be broken
# Identify and switch to sub-page
iframe <- driver$findElement(using = "xpath", value = "//iframe[@title='Constituency dashboard']")
driver$switchToFrame(iframe)
Sys.sleep(4)
# Set the number to start from in case loop is interuppted but we have cached results
start_from = 1
for (i in start_from:length(cons$constituency_id)) {
results <- constituency_dash_scraper(cons$cons_name[i], wait_base = 1)
cons$region_nation_hoclib23[i] <- results[[1]]
cons$population_hoclib23[i] <- results[[2]]
cons$area_hoclib23[i] <- results[[3]]
cons$age_0_29_hoclib23[i] <- results[[4]]
cons$age_30_64_hoclib23[i] <- results[[5]]
cons$age_65_plus_hoclib23[i] <- results[[6]]
cons$uc_claimants_hoclib23[i] <- results[[7]]
cons$median_house_price_hoclib23[i] <- results[[8]]
# Cache results collected so far
if(i == start_from){
saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
} else {
saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
file.remove(paste0("data/cache_cons_at", i-1, ".Rds")) # delete last cached object
}
Sys.sleep(1)
print(paste0(i, " of ", nrow(cons), " done."))
}
# Kill driver and java processes
driver$close()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
if (file.exists("data/hoc_library_scrape_raw.Rds")) {
cons$check_already_pulled <- NULL
saveRDS(cons, "data/hoc_library_scrape_raw_extra.Rds")
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
cons <- rbind(cons_hoc, cons)
saveRDS(cons, "data/hoc_library_scrape_raw.Rds")
} else {
# Save output
saveRDS(cons, "data/hoc_library_scrape_raw.Rds")
}
# Clean dashboard data
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
# pop numeric
cons_hoc$population_hoclib23 <- cons_hoc$population_hoclib23 %>%
str_remove_all(",") %>%
as.numeric()
# area numeric
cons_hoc$area_hoclib23 <- cons_hoc$area_hoclib23 %>%
str_extract(".*(?=\\s*sq\\.\\s*km)") %>%
str_remove_all(",") %>%
as.numeric()
# age perc
cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23/100 # Convert to proportion
cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23/100 # Convert to proportion
cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23/100 # Convert to proportion
# uc numeric
cons_hoc$uc_claimants_hoclib23 <- cons_hoc$uc_claimants_hoclib23 %>%
str_remove_all(",") %>%
as.numeric()
# house price numeric
cons_hoc$median_house_price_hoclib23 <- cons_hoc$median_house_price_hoclib23 %>%
str_remove_all(",|£") %>%
as.numeric()
# Merge API and dashboard data
cons_api <- readRDS("data/constituencies_api_raw.Rds")
cons_hoc <- cons_hoc %>%
select(-cons_name)
cons <- left_join(cons_api, cons_hoc, by = "constituency_id")
# Write out to database
dbWriteTable(db, "constituencies", cons, overwrite = TRUE)
response <- httr::GET("https://members-api.parliament.uk/api/Parties/GetActive/1") %>%
httr::content("parsed")
parties <- response$items
parties <- replace_null_with_na(parties)
for (i in 1:length(parties)) {
if (i == 1) {
parties_df <- data.frame(
party_id = c(parties[[i]]$value$id),
party_name = c(parties[[i]]$value$name),
party_abbreviation = c(parties[[i]]$value$abbreviation),
party_colour = c(parties[[i]]$value$backgroundColour)
)
} else {
parties_df2 <- data.frame(
party_id = c(parties[[i]]$value$id),
party_name = c(parties[[i]]$value$name),
party_abbreviation = c(parties[[i]]$value$abbreviation),
party_colour = c(parties[[i]]$value$backgroundColour)
)
parties_df <- rbind(parties_df, parties_df2)
}
}
parties_df <- parties_df %>%
mutate(
party_id = as.character(party_id),
party_name = as.character(party_name),
party_abbreviation = as.character(party_abbreviation),
party_colour = as.character(party_colour))
dbWriteTable(db, "parties", parties_df, overwrite = TRUE)
question_text <- dbGetQuery(db,
"
SELECT question_id, question_text
FROM questions
"
) %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
# Measure
# Initalise variables as NA
question_text$is_econ <- NA
question_text$is_health_welf <- NA
# clean question text
question_text$question_text <- question_text$question_text %>%
tolower() %>% # Convert to lower case
tm::removePunctuation() # remove punctuation
# Define dictionaries
# NOTE Citation: Albugh, Quinn, Julie Sevenans and Stuart Soroka. 2013. Lexicoder Topic Dictionaries, June 2013 versions, McGill University, Montreal, Canada.
# Download Lexicon Policy Topic Dictionaries
if (!file.exists("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd")) {
dir.create("data/lexicoder_dictionaries")
download.file(
"https://www.snsoroka.com/s/LTDjun2013.zip",
"data/lexicoder_dictionaries/policy_topics.zip")
unzip(
"data/lexicoder_dictionaries/policy_topics.zip",
exdir = "data/lexicoder_dictionaries", overwrite = TRUE)
}
# NOTE ChatGPT used to write code that parses XML
parsed_string <- readLines("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd") %>%
paste(collapse = "\n") %>%
xmlTreeParse(useInternalNodes = TRUE)
extract_pnodes <- function(cnode) {
sapply(xpathApply(cnode, "./pnode"), function(pnode) {
xmlAttrs(pnode)[["name"]]
})
}
# Extract cnodes and their corresponding pnodes
dictionaries_output <- xpathApply(parsed_string, "//cnode", function(cnode) {
cnode_name <- xmlAttrs(cnode)[["name"]]
pnodes <- extract_pnodes(cnode)
return(list(cnode_name = cnode_name, pnodes = pnodes))
})
# Convert to a single list
dictionaries <- list()
for (item in dictionaries_output) {
cnode_name <- item$cnode_name
pnodes <- item$pnodes
dictionaries[[cnode_name]] <- pnodes
}
rm(cnode_name, dictionaries_output, item, parsed_string, pnodes, raw_string)
econ_dict <- c( # Select relevant dictionaries
dictionaries$macroeconomics,
dictionaries$finance,
dictionaries$foreign_trade
)
# Convert to regex string and convert to lower for matching
econ_dict <- econ_dict %>%
paste(collapse="|") %>%
tolower()
health_welf_dict <- c( # Select relevant dictionaries
dictionaries$healthcare,
dictionaries$social_welfare
)
health_welf_dict <- health_welf_dict %>%
paste(collapse="|") %>%
tolower()
question_text <- question_text %>%
mutate(
is_econ = NA,
is_health_welf = NA
) %>%
mutate(
is_econ =
ifelse(
str_detect(question_text, econ_dict), 1, 0),
is_health_welf =
ifelse(
str_detect(question_text, health_welf_dict), 1, 0)
)
mean(question_text$is_econ)
mean(question_text$is_health_welf)
dbWriteTable(db, "question_topics", question_text, overwrite = TRUE)
dbListTables(db)
db_table_check("questions")
db_table_check("members")
db_table_check("constituencies")
db_table_check("parties")
db_table_check("question_topics")
analysis_df <- dbGetQuery(
db,
"
SELECT
members.name_display AS MP,
parties.party_abbreviation AS party_abbreviation,
SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,
constituencies.uc_claimants_hoclib23 AS uc_claimants,
constituencies.median_house_price_hoclib23 AS median_house_price,
constituencies.population_hoclib23 / constituencies.area_hoclib23 AS density,
constituencies.age_0_29_hoclib23 AS age_29,
constituencies.age_30_64_hoclib23 AS age_30_64,
constituencies.age_65_plus_hoclib23 AS age_65,
/* Majority */
constituencies.last_election_1_majority,
constituencies.last_election_2_majority,
constituencies.last_election_3_majority,
constituencies.last_election_4_majority,
constituencies.last_election_1_electorate,
constituencies.last_election_2_electorate,
constituencies.last_election_3_electorate,
constituencies.last_election_4_electorate,
/* results */
constituencies.last_election_1_result,
constituencies.last_election_2_result,
constituencies.last_election_3_result,
constituencies.last_election_4_result
FROM questions
JOIN question_topics ON questions.question_id = question_topics.question_id
LEFT JOIN members ON questions.asking_member = members.member_id
/* this has to be joined before anything */
/* from members to avoid dropping rows */
/* select row where date of question comes between the dates valid range */
AND REPLACE(questions.question_tabled_when, '-', '')
/* no date class in SQLite, so convert to string*/
BETWEEN REPLACE(members.member_date_valid_min, '-', '')
AND REPLACE(members.member_date_valid_max, '-', '')
LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id
LEFT JOIN parties ON parties.party_id = members.latest_party_id
GROUP BY members.member_id
"
) %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
# Convert majority variables into +/- depending on whether current MP won or lost
analysis_df <- analysis_df %>%
mutate(
last_election_1_majority =
ifelse(str_detect(last_election_1_result, party_abbreviation),
last_election_1_majority,
last_election_1_majority * -1),
last_election_2_majority =
ifelse(str_detect(last_election_2_result, party_abbreviation),
last_election_2_majority,
last_election_2_majority * -1),
last_election_3_majority =
ifelse(str_detect(last_election_3_result, party_abbreviation),
last_election_3_majority,
last_election_3_majority * -1),
last_election_4_majority =
ifelse(str_detect(last_election_4_result, party_abbreviation),
last_election_4_majority,
last_election_4_majority * -1)
) %>%
select(-c(last_election_1_result, last_election_2_result, last_election_3_result, last_election_4_result))
# Calculate marginality
analysis_df <- analysis_df %>%
mutate(
marginality_1 = (last_election_1_majority / last_election_1_electorate),
marginality_2 = (last_election_2_majority / last_election_2_electorate),
marginality_3 = (last_election_3_majority / last_election_3_electorate),
marginality_4 = (last_election_4_majority / last_election_4_electorate)
) %>%
select(-c(last_election_1_majority, last_election_2_majority, last_election_3_majority, last_election_4_majority, last_election_1_electorate, last_election_2_electorate,last_election_3_electorate,last_election_4_electorate))
# Calcualte mean marginality
analysis_df <- analysis_df %>%
mutate(
mean_marginality = rowMeans(select(., starts_with("marginality_")))
) %>%
select(-c(marginality_1, marginality_2, marginality_3, marginality_4))
analysis_df <- analysis_df %>%
mutate(econ_slant = econ_prop - health_welf_prop)
hist <- analysis_df %>%
ggplot() +
geom_vline(xintercept = 0)+
geom_histogram(aes(x=econ_slant), alpha = 0.7) +
xlim(-1,1)+
labs(title = "Distribution of sconomic slant") +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
hist
geog_data <- dbGetQuery(db,
"
SELECT
constituencies.cons_name AS constituency,
SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,
constituencies.shapefile AS con_shapefile
FROM questions
LEFT JOIN members ON questions.asking_member = members.member_id
/* this has to be joined before anything */
/* from members to avoid dropping rows */
/* select row where date of question comes between the dates valid range */
AND REPLACE(questions.question_tabled_when, '-', '')
/* no date class in SQLite, so convert to string*/
BETWEEN REPLACE(members.member_date_valid_min, '-', '')
AND REPLACE(members.member_date_valid_max, '-', '')
LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id
LEFT JOIN question_topics ON questions.question_id = question_topics.question_id
GROUP BY constituencies.cons_name
"
) %>%
replace_na_chr() %>% # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
filter(!is.na(con_shapefile)) # Drop constituencies without shapefiles
geog_data <- geog_data %>%
mutate(econ_slant = econ_prop - health_welf_prop)
# Add all constituencies as base map
# To plot the base map, we want all constituencies, not just ones where questions have been asked.
# We can get this from OSMaps
# For reproducibility, the following code downloads and processes the data programmatically
# To download manually, use https://osdatahub.os.uk/downloads/open/BoundaryLine
if (!file.exists("data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")) {
options(timeout=600) # Takes some time to download so we need to increase the "timeout" setting
dir.create("data/whole_UK_shapefile")
download.file("https://api.os.uk/downloads/v1/products/BoundaryLine/downloads?area=GB&format=ESRI%C2%AE+Shapefile&redirect", "data/whole_UK_shapefile/OS_zip.zip") #
options(timeout=60) # Reset timeout
# unzip
unzip(
"data/whole_UK_shapefile/OS_zip.zip",
files = c(
"Data/GB/westminster_const_region.dbf",
"Data/GB/westminster_const_region.prj",
"Data/GB/westminster_const_region.shp",
"Data/GB/westminster_const_region.shx"),
exdir = "data/whole_UK_shapefile",
overwrite = TRUE)
file.remove("data/whole_UK_shapefile/OS_zip.zip")
}
basemap_sf <-
st_read(
dsn = "data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")
# Make shape files for constituencies in the database
# NOTE: approach to converting from GeoJSON from ChatGPT
temp_geojson <- tempfile(fileext = ".geojson") # Create a temporary file
writeLines(geog_data$con_shapefile, con = temp_geojson) # Write out to temporary file
geog_sf <- st_read(dsn = temp_geojson) # Read the GeoJSON file into an sf object
unlink(temp_geojson) # Delete temporary file
geog_data <- cbind(geog_data, geog_sf)
geog_data <- st_as_sf(geog_data) # Convert to SF
# Plot bias
bias_map <-
tm_shape(basemap_sf) +
tm_sf(col = "white") +
tm_shape(geog_data) +
tm_polygons(
col = "econ_slant",
style = "cont",
midpoint = 0,
title = "Economic slant",
palette = "RdBu",
#legend.hist = TRUE
) +
tm_legend(
legend.outside=TRUE
)
bias_map
# Run linear model
lm_marg <- summary(lm(econ_slant ~ mean_marginality, data = analysis_df))
slope_marg <- lm_marg$coefficients[[2,1]]
se_marg <- lm_marg$coefficients[[2,2]]
# Make plot
plot_marginality <- analysis_df %>%
ggplot(aes(x=mean_marginality, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=-0.2, y= 0.5),
label = paste0(
"Slope: ", signif(slope_marg, digits = 3),
"\n SE: ", signif(se_marg, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
plot_marginality
# Run linear model
lm_uc <- summary(lm(econ_slant ~ uc_claimants, data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_uc <- lm_uc$coefficients[[2,1]]
se_uc <- lm_uc$coefficients[[2,2]]
# Make plots
plot_uc <- analysis_df %>%
ggplot(aes(x=uc_claimants, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=-0.2, y= 0.5),
label = paste0(
"Slope: ", signif(slope_uc, digits = 3),
"\n SE: ", signif(se_uc, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
plot_uc
# Run linear models
lm_house_price <- summary(lm(econ_slant ~ median_house_price, data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_house_price <- lm_house_price$coefficients[[2,1]]
se_house_price <- lm_house_price$coefficients[[2,2]]
# Make plot
plot_house_price <- analysis_df %>%
ggplot(aes(x=median_house_price, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=9*10^5, y= 0.5),
label = paste0(
"Slope: ", signif(slope_house_price, digits = 3),
"\n SE: ", signif(se_house_price, digits = 3)
),
size = 3) +
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
plot_house_price
## Under 29
# Run linear model
lm_age_29 <- summary(lm(econ_slant ~ age_29, data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_age_29 <- lm_age_29$coefficients[[2,1]]
se_age_29 <- lm_age_29$coefficients[[2,2]]
# Make plots
plot_age_29 <- analysis_df %>%
ggplot(aes(x = age_29, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=0.5, y= 0.5), label = paste0(
"Slope: ", signif(slope_age_29, digits = 3),
"\n SE: ", signif(se_age_29, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
## 30-65
# Run linear model
lm_age_30_64 <- summary(lm(econ_slant ~ age_30_64, data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_age_30_64 <- lm_age_30_64$coefficients[[2,1]]
se_age_30_64 <- lm_age_30_64$coefficients[[2,2]]
plot_age_30_64 <- analysis_df %>%
ggplot(aes(x = age_30_64, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=0.5, y= 0.5), label = paste0(
"Slope: ", signif(slope_age_29, digits = 3),
"\n SE: ", signif(se_age_29, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
## 65+
# Run linear model
lm_age_65 <- summary(lm(econ_slant ~ age_65 , data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_age_65 <- lm_age_65$coefficients[[2,1]]
se_age_65 <- lm_age_65$coefficients[[2,2]]
plot_age_65 <- analysis_df %>%
ggplot(aes(x = age_65, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=0.3, y= 0.5),
label = paste0(
"Slope: ", signif(slope_age_29, digits = 3),
"\n SE: ", signif(se_age_29, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
plot_age_29
plot_age_30_64
plot_age_65
lm_density <- summary(lm(econ_slant ~ density, data = analysis_df))
slope_density <- lm_density$coefficients[[2,1]]
se_density <- lm_density$coefficients[[2,2]]
plot_density <- analysis_df %>%
ggplot(aes(x=density, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=15000, y= 0.5),
label = paste0(
"Slope: ", signif(slope_density, digits = 3),
"\n SE: ", signif(se_density, digits = 3)
),
size = 3)+
ylim(-1,1) +
theme(
panel.background = element_rect(fill = "white", color = "black"),
aspect.ratio = 1
)
plot_density
# Disconnect from local database
DBI::dbDisconnect(db)
sessionInfo()R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/London
tzcode source: system (glibc)
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] gridExtra_2.3 rpart.plot_3.1.1 rpart_4.1.21 vip_0.4.1
[5] yardstick_1.2.0 workflowsets_1.0.1 workflows_1.1.3 tune_1.1.2
[9] rsample_1.2.0 recipes_1.0.9 parsnip_1.1.1 modeldata_1.2.0
[13] infer_1.0.5 dials_1.2.0 scales_1.3.0 broom_1.0.5
[17] tidymodels_1.1.1 ranger_0.16.0 ggrepel_0.9.4 sf_1.0-15
[21] tmap_3.3-4 XML_3.99-0.16 tm_0.7-11 NLP_0.2-1
[25] RSelenium_1.7.9 httr_1.4.7 RSQLite_2.3.4 DBI_1.1.3
[29] here_1.0.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[33] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[37] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8
[4] semver_0.2.0 magrittr_2.0.3 farver_2.1.1
[7] rmarkdown_2.25 vctrs_0.6.5 memoise_2.0.1
[10] base64enc_0.1-3 terra_1.7-65 htmltools_0.5.7
[13] leafsync_0.1.0 raster_3.6-26 parallelly_1.36.0
[16] KernSmooth_2.23-22 htmlwidgets_1.6.4 stars_0.6-4
[19] cachem_1.0.8 iterators_1.0.14 lifecycle_1.0.4
[22] pkgconfig_2.0.3 Matrix_1.6-3 R6_2.5.1
[25] fastmap_1.1.1 future_1.33.1 digest_0.6.33
[28] colorspace_2.1-0 furrr_0.3.1 wdman_0.2.6
[31] rprojroot_2.0.4 leafem_0.2.3 crosstalk_1.2.1
[34] labeling_0.4.3 lwgeom_0.2-13 fansi_1.0.5
[37] timechange_0.2.0 mgcv_1.9-0 abind_1.4-5
[40] compiler_4.3.2 proxy_0.4-27 bit64_4.0.5
[43] withr_2.5.2 backports_1.4.1 MASS_7.3-60
[46] lava_1.7.3 tmaptools_3.1-1 leaflet_2.2.1
[49] classInt_0.4-10 caTools_1.18.2 tools_4.3.2
[52] units_0.8-5 future.apply_1.11.1 nnet_7.3-19
[55] glue_1.6.2 nlme_3.1-163 grid_4.3.2
[58] generics_0.1.3 gtable_0.3.4 tzdb_0.4.0
[61] class_7.3-22 data.table_1.14.8 hms_1.1.3
[64] sp_2.1-2 xml2_1.3.6 utf8_1.2.4
[67] foreach_1.5.2 pillar_1.9.0 lhs_1.1.6
[70] splines_4.3.2 lattice_0.22-5 survival_3.5-7
[73] bit_4.0.5 tidyselect_1.2.0 knitr_1.45
[76] xfun_0.41 hardhat_1.3.0 timeDate_4032.109
[79] stringi_1.8.2 DiceDesign_1.10 yaml_2.3.7
[82] evaluate_0.23 codetools_0.2-19 cli_3.6.1
[85] munsell_0.5.0 dichromat_2.0-0.1 Rcpp_1.0.11
[88] globals_0.16.2 png_0.1-8 binman_0.1.3
[91] gower_1.0.1 assertthat_0.2.1 blob_1.2.4
[94] bitops_1.0-7 GPfit_1.0-8 listenv_0.9.0
[97] viridisLite_0.4.2 slam_0.1-50 ipred_0.9-14
[100] prodlim_2023.08.28 e1071_1.7-14 rlang_1.1.2